-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update REPL.fuzzyscore to use string distance #50412
Conversation
@tecosaur I've used your implementation as suggested in the issue. What's your name so I can attribute? |
CI failure is a spurious failure in 32-bit Windows only (as is tradition) |
IIUC, this is copied from https://github.com/tecosaur/DataToolkitBase.jl/blob/a0aa954c0f1aa3b7578c92021e7017987459fb6f/src/model/utils.jl#L43-L44, which was added by https://github.com/tecosaur/DataToolkitBase.jl/commit/8be9e3ce6333f4b0877f5ed62bff6de87a5a4c07.patch
|
Squashed with attribution as suggested, so that the person clicking "squash and merge" don't have to remember adding attribution to the commit. The commit content is identical to that already approved. |
I'm happy with this attribution 🙂 (glad my work was useful Jacob!), but for the record I also wouldn't be fussed if you left it out. My implementation is a variation on the StringDistances.jl implementation, which is itself a reimplementation of a C# implementation, which is itself a variation on some other implementation, etc. — it's all rather muddy. I've also been thinking about whether I can introduce a half-penalty capitalisation transition, which could also be good here if I can work it out, but we can deal with that if/when it actually happens. |
This code then may be claiming copyrights under https://www.codeproject.com/info/cpol10.aspx, which the FSF does not consider compatible with the MIT license that people have subsequently attempted to apply to it and cannot be linked against GPL code. (https://www.gnu.org/licenses/license-list.html#cpol) https://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm-2 |
Where did https://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm-2 come from? The C# implementation I was talking about is http://blog.softwx.net/2015/01/optimizing-damerau-levenshtein_15.html, which is an implementation of one of the algorithms listed on the Wikipedia page (https://en.wikipedia.org/wiki/Damerau–Levenshtein_distance#Distance_with_adjacent_transpositions), which seems to come from an ACM summary article (https://dl.acm.org/doi/10.1145/1963190.1963191). This is what I'm referring to by "it's all rather muddy". My suspicion is that this code actually isn't copyrightable, since at the end of the day it's a rather straightforward implementation of a very well-explored algorithm. |
Looking closer at the original codeproject, I found the original code there, and I think the code in the blogpost is sufficiently different from it as to be merely "inspired by" and not "based on" that code. They appear to effectively implement a different pseudo-code, since the blog optimized and reorganized it in a way that made the result quite different from the original. We should add a second line to the commit message to satisfy the licensing terms of the StringDistances.jl package then also:
And it should be good to merge now. We could add one more line for the blog post (which is also licensed as MIT), but the author there appears to have assigned the copyright in that license over to "anonymous". |
(note that these co-authored-by lines are not strictly required for licensing, since both authors have already been added as copyright holders to the project as a whole via their past commits: https://github.com/JuliaLang/julia/blob/master/LICENSE.md) |
I'm not sure about that being the "original code", since that "original code" is from 2012, but the summary paper with the algorithm is from 2011, and whatever the source is in that summary paper, presumably is pre-2011 and likely came with an implementation. |
Oh also, I referenced the StringDistances.jl implementation, that C# implementation + description in that blog post, and the Wikipedia article when writing my version. So... more mud? 😅. I'm still on "I don't think this is copyrightable". |
Sorry, I meant original code with respect to the codeproject implementation. In comparison to the blog post implementation then, they don't look to me like they share ancestry, other than a comment in the blog that "this is based on Sten Hjelmqvist's" algorithm at that codeproject link. It seems to me now that the comment was intended to credit him merely for researching into the algorithm, not the actual implementation.
In the US, that question hasn't really been settled yet, c.f. https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_Inc. (other jurisdictions may have different interpretations here) |
That is all fine. You cannot copyright an algorithm, and you can even copy directly from multiple sources as long as the licenses are all compatible (all are MIT in this case). Aside: I find it amusing that the licensing question is happening on a PR about string distances, since that is quite related to what licensing itself is about measuring. |
What needs to be done to push this through? From reading the comments it looks like all is in order. |
The old heuristics were not particularly helpful. These new heuristics should be easier to reason about, since the score is between 0 and 1, and also yield much more intuitive results. Co-authored-by: TEC <git@tecosaur.net> Co-authored-by: matthieugomez <gomez.matthieu@gmail.com>
Bump. This is now rebased, and I added Matthieu Gomez as a co-author in the commit. Should be good to merge if CI passes, I think |
The old heuristics were not particularly helpful. These new heuristics should be easier to reason about, since the score is between 0 and 1, and also yield much more intuitive results.
Closes #49466
See #49562